Skip to content

Implement multi-row INSERT batching for PreparedStatement#944

Merged
jayantsing-db merged 12 commits into
databricks:mainfrom
josecsotomorales:feature/multi-row-insert-batching
Sep 4, 2025
Merged

Implement multi-row INSERT batching for PreparedStatement#944
jayantsing-db merged 12 commits into
databricks:mainfrom
josecsotomorales:feature/multi-row-insert-batching

Conversation

@josecsotomorales
Copy link
Copy Markdown
Contributor

@josecsotomorales josecsotomorales commented Aug 16, 2025

Linked issue: #867

This PR implements multi-row INSERT batching optimization for prepared statements to improve performance when executing large batches of INSERT operations. The implementation combines multiple single-row INSERT statements into fewer multi-row INSERT statements while respecting Databricks' 256 parameter limit.

Adds a new InsertStatementParser utility for parsing INSERT statements and generating multi-row equivalents
Optimizes executeBatch() and executeLargeBatch() to use multi-row INSERT when possible
Includes parameter limit-aware chunking to handle large batches that exceed the 256 parameter maximum

Impact illustration (10k rows, 5 columns, 50 ms RTT):
• Before (single-row inserts): 10,000 statements → ~500s of RTT + server planning.
• After (batched): 196 statements (10k ÷ 51) → ~9.8s of RTT.
• That’s about a 50× reduction in latency, not even counting server CPU savings.

Signed-off-by: josecsotomorales josecsmorales@gmail.com, Jayant Singh jayant.singh@databricks.com

* Add INSERT statement detection with new INSERT_PATTERN regex
* Create InsertStatementParser utility for parsing INSERT statements
* Enhance DatabricksPreparedStatement.executeLargeBatch() to:
  - Detect compatible INSERT operations in batch
  - Combine multiple single-row INSERTs into multi-row INSERT
  - Generate optimized SQL like: INSERT INTO table VALUES (?), (?), (?)
  - Fall back to individual execution for non-INSERT statements
* Add comprehensive unit tests for all new functionality
* Maintain backward compatibility and proper JDBC error semantics

This addresses performance issues with Spark JDBC writes by reducing
the number of database round-trips from N individual INSERTs to 1
multi-row INSERT statement.
…ERT batching

Resolves issue where large batches exceeded Databricks' 256 parameter limit by implementing intelligent parameter chunking:

- Add MAX_QUERY_PARAMETERS constant (256) to DatabricksJdbcConstants
- Implement smart chunking logic: maxRowsPerChunk = 256 / columnCount
- Automatically split large batches into optimally-sized chunks
- Maintain multi-row INSERT performance benefits within parameter limits
- Add comprehensive tests covering chunking scenarios and edge cases
- Ensure minimum 1 row per chunk for very wide tables (>256 columns)

Example: 60 rows × 5 columns = 300 parameters (exceeds limit)
→ Automatically chunked into: 51 rows + 9 rows (255 + 45 parameters)
@jayantsing-db jayantsing-db self-requested a review August 21, 2025 18:44
@jayantsing-db jayantsing-db self-assigned this Aug 21, 2025
@jayantsing-db jayantsing-db requested a review from Copilot August 21, 2025 18:44
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements multi-row INSERT batching optimization for prepared statements to improve performance when executing large batches of INSERT operations. The implementation combines multiple single-row INSERT statements into fewer multi-row INSERT statements while respecting Databricks' 256 parameter limit.

  • Adds a new InsertStatementParser utility for parsing INSERT statements and generating multi-row equivalents
  • Optimizes executeBatch() and executeLargeBatch() to use multi-row INSERT when possible
  • Includes parameter limit-aware chunking to handle large batches that exceed the 256 parameter maximum

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/main/java/com/databricks/jdbc/common/util/InsertStatementParser.java New utility class for parsing INSERT statements and generating multi-row batched versions
src/main/java/com/databricks/jdbc/common/DatabricksJdbcConstants.java Adds INSERT pattern constant and maximum query parameters limit
src/main/java/com/databricks/jdbc/api/impl/DatabricksStatement.java Adds isInsertQuery() method to detect INSERT statements
src/main/java/com/databricks/jdbc/api/impl/DatabricksPreparedStatement.java Implements multi-row INSERT batching logic with parameter chunking
src/test/java/com/databricks/jdbc/common/util/InsertStatementParserTest.java Comprehensive tests for INSERT statement parsing and multi-row generation
src/test/java/com/databricks/jdbc/api/impl/DatabricksStatementTest.java Tests for INSERT statement detection
src/test/java/com/databricks/jdbc/api/impl/DatabricksPreparedStatementTest.java Updated tests to verify multi-row batching behavior and parameter chunking

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copy link
Copy Markdown
Collaborator

@jayantsing-db jayantsing-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added some comments/suggestions.

Comment thread src/main/java/com/databricks/jdbc/common/util/InsertStatementParser.java Outdated
Comment thread src/main/java/com/databricks/jdbc/common/util/InsertStatementParser.java Outdated
Comment thread src/main/java/com/databricks/jdbc/common/util/InsertStatementParser.java Outdated
Comment thread src/main/java/com/databricks/jdbc/api/impl/DatabricksPreparedStatement.java Outdated
Comment thread src/main/java/com/databricks/jdbc/api/impl/DatabricksPreparedStatement.java Outdated
Comment thread src/main/java/com/databricks/jdbc/api/impl/DatabricksPreparedStatement.java Outdated
… rollout

  - Add EnableBatchedInserts connection property for controlled rollout
  - Enhance Javadoc documentation with detailed INSERT compatibility examples
  - Replace null returns with specific DatabricksParsingException for better debugging
  - Eliminate redundant INSERT pattern validation for improved performance
  - Consolidate parsing logic to reduce code duplication
  - Add comprehensive input validation with clear error messages
@josecsotomorales
Copy link
Copy Markdown
Contributor Author

@jayantsing-db I've committed new changes to address feedback. Can you please review again?

Copy link
Copy Markdown
Collaborator

@jayantsing-db jayantsing-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes. I request that the default value for the feature remains set to 0 for now to avoid any accidental disruptions. Apart from that, just a few minor comments. Please feel free to merge once those are addressed.

Comment on lines +120 to +127
if (!INSERT_PATTERN.matcher(trimmedSql).find()) {
throw new DatabricksParsingException(
"SQL statement is not an INSERT operation: " + trimmedSql,
DatabricksDriverErrorCode.INPUT_VALIDATION_ERROR);
}

// Then extract detailed information using our specific pattern
Matcher matcher = INSERT_DETAILS_PATTERN.matcher(trimmedSql);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opinion: Consider reusing the matcher object.

Comment thread src/main/java/com/databricks/jdbc/common/DatabricksJdbcUrlParams.java Outdated
@jayantsing-db
Copy link
Copy Markdown
Collaborator

gentle reminder (you maybe already aware): request to sign-off the final commit to main. For more info, please take a look at https://github.com/databricks/databricks-jdbc/blob/main/CONTRIBUTING.md

- Changed ENABLE_BATCHED_INSERTS default value from "1" to "0" in DatabricksJdbcUrlParams
- Updated batch statement tests to explicitly enable EnableBatchedInserts=1 for proper testing
- Added lenient mocking to prevent unnecessary stubbing exceptions in test cases
- This ensures batched inserts are disabled by default while maintaining test coverage

Signed-off-by: josecsotomorales <josecsmorales@gmail.com>
@josecsotomorales
Copy link
Copy Markdown
Contributor Author

gentle reminder (you maybe already aware): request to sign-off the final commit to main. For more info, please take a look at https://github.com/databricks/databricks-jdbc/blob/main/CONTRIBUTING.md

@jayantsing-db, I've addressed all the requested changes and signed off on my commit. Even though the PR is approved, I'm unable to merge it due to the lack of permissions. Could you please merge it?

@jayantsing-db jayantsing-db merged commit df447ec into databricks:main Sep 4, 2025
12 of 13 checks passed
@jayantsing-db
Copy link
Copy Markdown
Collaborator

Hey @josecsotomorales, I just came across this post: https://qualytics.ai/blog/qualytics-databricks-partnership/. Curious whether the integration is using the OSS JDBC driver?

@josecsotomorales
Copy link
Copy Markdown
Contributor Author

Hey @josecsotomorales, I just came across this post: https://qualytics.ai/blog/qualytics-databricks-partnership/. Curious whether the integration is using the OSS JDBC driver?

Hi @jayantsing-db, Yep! — We support two modes today.

Standard Connector: uses the Databricks JDBC driver for broad compatibility across environments. Thanks again for accepting our contributions — that helped a ton on our side! 🚀

Unity Catalog Mode: more Spark-native. We do direct Spark reads against Unity Catalog–managed tables, which avoids JDBC, integrates cleanly with UC permissions, and performs better at scale.

@jayantsing-db
Copy link
Copy Markdown
Collaborator

Great, thanks and congratulations on the launch!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants